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Abstract— This paper presents a new methodology for auto- 
matic knowledge driven data mining based on the theory of 
Mercer Kernels, which are highly nonlinear symmetric positive 
definite mappings from the original image space to a very 
high, possibly infinite dimensional feature space. We describe 
a new method called Mixture Density Mercer Kernels to learn 
kernel function directly from data, rather than using pre-delined 
kernels. These data adaptive kernels can encode prior knowledge 
in the kernel using a Bayesian formulation, thus allowing for 
physical information to be encoded in the model. Specifically, we 
demonstrate the use of the algorithm in situations w ith extremely 
small samples of data. We compare the results with existing 
algorithms on data from the Sloan Digital Sky Survey (SDSS) and 
demonstrate the method’s superior performance against standard 
methods. The code for these experiments has been generated with 
the AUTOBaYES tool, which automatically generates efficient and 
documented C/C++ code from abstract statistical model specifica- 
tions. The core of the system is a schema library which contains 
templates for learning and knowledge discovery algorithms like 
different versions of EM, or numeric optimization methods 
like conjugate gradient methods. The template instantiation is 
supported by symbolic-algebraic computations, which allows 
AUTOBaYES to find closed-form solutions and, where possible, to 
integrate them into the code. The results show 7 that the Mixture 
Density Msrcsr Kernel described hers outperforms tree-based 
classification in distinguishing high-redshift galaxies from low- 
redshift galaxies by approximately 16% on test data, bagged 
trees by approximately 7%, and bagged trees built on a much 
larger sample of data by approximately 2%. 

I. Introduction 

There is a growing interest in the machine learning and data 
mining communities in the field of Mercer Kernels due to their 
mathematical properties as well as their use in Support Vector 
Classifiers and Regressors. The theory of Mercer Kernels 
allows data which may be embedded in a vector space, such as 
spectral lines, physical measurements, stock market indices, or 
may not arise from a vector space, such as sequences, graphs, 
and trees to' be treated using similar mathematics. Work by 
Haussler [13] shows how to map sequences of symbols into 
a feature space using kernel similarity measures. In the same 
paper, Haussler introduced the idea of a Probabilistic Kernel 
function, or P-Kernel that obeys Mercer's conditions (i.e., the 
kernel function must be symmetric and positive definite) and 
is defined as follows: 

(1) 


This kernel function says that two points are similar if they are 
both more likely given the model ©.Thus, data points lying in 
lZ n which may be far away from each other in the Euclidean 
sense may turn out to be 'similar' as measured by this kernel. 
We generalize this notion of similarity using Mixture Density 
Mercer Kernels. 

In a recent paper [18], the notion of Mixture Density Mercer 
Kernels was introduced. The idea is to express the distribution 
function P(x t ) in terms of a full Bayesian formulation of 
a density function. The kernel function is created by taking 
bootstrap aggregate samples models based on the distribution 
function. Thus, for one bootstrap sample, we have: 

c 

P(x t |©) = y>(c)P( Xi |0 c ) (2) 

C= 1 

Due to the Bayesian formulation, prior distributions can be 
placed on the model parameters for each bootstrap sample. 
This allows us to encode domain knowledge into each mode). 
The kernel function is then composed of the sum of the outer 
products of the class membership matrices. Thus, we have: 

K(x i; Xj) = $ r (x*)$(xj) 

1 m c m 

~ A n (c m |Xj)P m (c m |xy) (3) 

77 T —1 Ojyx — 1 

where K represents the a sum of M bootstrap samples. $(x*) 
is a composite class membership function, where each member 
of the composite is the posterior class distribution for a model. 
Thus, for M models, we have: 

$(xi) a [Pi(c= l|x t -),P 1 (c= 2|x*),..., 

Pi{c = C\xi),P 2 {c= l|xj),.. .,P M (c= C|xi)] 

In the hard clustering case, where the posterior class distribu- 
tion for a given model is a zero-one vector, the (i.j) element 
of the Mixture Density Mercer Kernel describes how many 
times, on average, the M models agreed that data points x t 
and Xj arose from the same mode in the density function. 

As is the case with ensemble methods, the greater the 
variability in these models, the better the performance of the 
overall model. We demonstrate the degree of variability in the 
models due to different initial conditions in terms of variations 


P(*x,Xj) = P(x i |©)P(x,-|0) 
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in the converged likelihood value. This paper elucidates the 
idea and demonstrates its feasibility in working with a large 
astronomical data set known as the Sloan Digital Sky Survey. 

The information required to construct the kernel function 
(i.e., the class membership matrix) can be computed by an 
application of the EM-algorithm [9] on a suitable training set. 
A number of EM- implementations is available (e.g., Autoclass 
[6], [7], EMMIX [15], MCLUST [11]) and any of them 
could be used. However, in order to insert domain knowledge 
into the kernel matrix, the EM-code has to be modified 
accordingly; this is time-consuming and etTor-prone. More- 
over, since the choice of a particular prior has consequences 
for the quality of the kernel matrix, a certain amount of 
experimentation is necessary. However, the exact form of the 
prior can also have substantial consequences on the details of 
the implementation (e.g., the form of the M-step or the internal 
data structures) which magnifies the implementation problem. 
Fortunately, the overall structure of the algorithm remains the 
same and the details can be derived mechanically. Here, we 
have applied AutoBayes to produce the different variants 
of the EM-algorithm. AutoBayes [5], [10], [12] is a fully 
automatic program synthesis system that generates efficient 
and documented C/C++ code from abstract statistical model 
specifications. 

II. Notation 

• D is the dimension of the data, N is the number of data 
points x 2 drawn from a D dimensional space 

• M is the number of probabilistic models used in 
generating the kernel function. 

• C is the number of mixture components in each 
probabilistic model. In principle one can use a different 
number of mixture components in each model. However, 
here we choose a fixed number for simplicity. 

• x t is a p x 1 dimensional real column vector that 
represents the data sampled from a data set X. 

• <£>(x) : 7Z P T is generally a nonlinear mapping to 
a high, possibly infinite dimensional, feature space F. 
This mapping operator may be explicitly defined or may 
be implicitly defined via a kernel function. 

• A'(xi.Xj) = ^ > (x l )<^ T (xj) € 77 is the kernel function 
that measures the similarity between data points x t and 
Xj. If JT is a Mercer kernel, it can be written as the 
outer product of the map As i and j sweep through 
the N data points, it generates an N x N kernel matrix, 

• p c is the mixture weight for the c th mixture, and q(i,c) 
is the posterior probability of class membership, i.e., 

= P(c\x t ) 


♦ 0 = (p. fx. a) is the entire set of parameters that specify 
a mixture model. 


III. Kernels Built from Ensembles 

This section overviews two methods that we have created 
to build Mercer Kernels directly from data. The first method, 
which we call Mixture Density Mercer Kernels (MDMK) 
uses an ensemble of probabilistic models to create a Mercer 
Kernel. The second method, which we call the Bagged Tree 
Kernel (BTK), uses an ensemble of decision trees to create a 
kernel matrix. We begin with a brief discussion of the MDMK 
followed by a synopsis of the BTK. 

A. Mixture Density Mercer Kernels 

The Mixture Density Mercer Kernel function given in (3) 
is similar to the Cluster-based Similarity Partitioning Algo- 
rithm (CSPA) discussed in [1]. While their implementation 
uses hard clustering, it can be extended to the soft cluster- 
ing (expectation-maximization) approach described here. An 
important difference between this work and previous work, 
however, is that we intend to use our kernel function in 
support vector machines for classification and regression. The 
modularity of SVMs allow different kernels to be implemented 
that model the underlying data generating process in different 
ways. 

The Mixture Density Mercer Kernel is built using an 
ensemble of Bayesian mixture density models. The Bayesian 
formulation allows for prior information to be encoded in the 
model. Then, rather than computing a maximum-likelihood 
estimator, we compute a maximum a posteriori estimator 
which includes the likelihood function and the prior. The 
greater the heterogeneity of the models used in generating the 
kernel, the more effective the procedure. In the AutoBayes 
implementation of the procedure, the training data is sampled 
M times with replacement. These overlapping data sets, com- 
bined with random initial conditions for the EM algorithm, 
aid in generating a heterogenous ensemble. 

In this work, we assume a Gaussian distribution as the 
model for each class, with priors expressed in terms of a 
conjugate prior for the Gaussian distribution. A conjugate prior 
is defined as a family F of probability density functions such 
that for every / e F, the posterior /(© |x) also belongs to F. 

For a mixture of Gaussians model, priors can be set as 
follows [3]. For priors on the means, either a uniform dis- 
tribution or a Gaussian distribution can be used. For priors 
on the covariance matrices, the Wishart density can be used: 
P(Ei|a,/3. J) a exp(-air(E" 1 J)/2). For priors on 

the mixture weights, a Dirichlet distribution can be used: 
P(pi\y) oc w here Pi = P(c = i ). Maximum 

a posteriori estimation is performed by taking the log of the 
posterior likelihood of each data point given the model 0. 
The following function is thus optimized using the Expectation 
Maximization [8] for a Gaussian mixture model with priors on 
the means only. 



For example, the log posterior probability 


log(P(jLt. x | c.cr 2 ) x P(c ] p)) 


for a model with conjugate priors on the means is thus 
computed as follows. The first step is to marginalize (i.e., 
sum out) the latent variable c via the expectation q. However, 
to keep this step tractable, it is important to delay the actual 
summation as long as possible. We thus introduce the “delayed 
summation 5 ' operator which gives us 

logPO.xj c,a 2 ) + Y^ 1 N log P (c I p) 

We can then apply the product rule to decompose the proba- 
bilities and then replace them by the density functions. This 
gives us the formidably looking log-likelihood function 
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Now we can actually execute the delayed summations; the 
crucial step is to replace all occurrences of the latent variable 
inside the body of the YllZ^ci^qi by appropriately re-indexed 
and weighted occurrences. For example, this step simpli- 

fies ElEE log nE p Cj into log nf =1 Qj,iPd ■ After 
further simplifications, we then get the more tractable form: 
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After dropping the terms in the first line (which are inde- 
pendent of the goal variables), and introducing the Lagrange- 
multiplier A we get the following final form of the log- 


likeiihood function. 
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This rather onerous derivation of the likelihood function 
(including the BTjX-code for the displayed formulae!) was 
generated fully automatically by AutoBayes. It is important 
to note that although this likelihood function is much more 
complicated than the likelihood function for a model with no 
priors, it does not change the underlying parametric statistical 
description of the data. Thus, the slightly increased computa- 
tional burden is only seen at the model building stage, but not 
at the model evaluation stage. 

The art of choosing priors is one of much study in Bayesian 
data analysis. As will be seen later in this work, we choose 
priors based on human knowledge about the domain problem 
as well as from various visualizations of the data that indicate 
where modes should be placed. In the problem of classifying 
low and high redshift galaxies, we only include priors on the 
means of the Gaussians, rather than on the mixture weights or 
the covariance matrices. 


B. Bagged Tree Kernels 

The Bagged Tree Kernel uses an ensemble of bagged 
trees to ’vote’ on the most likely class distribution for a 
particular pair of data points x^.xy. Thus, in Equation 3 the 
class distribution P m (c m |x*) is given by a tree function. The 
remaining computations are the same as for the MDMK. We 
introduce this kernel function as a natural extension of an 
ensemble of bagged trees into the support vector machines. 

IV. AutoBayes 

AutoBayes is a fully automatic program synthesis system 
for the data analysis domain. It is implemented in Prolog and 
comprises about 80,000 lines of documented code. From the 
outside, it looks similar to a compiler: it takes an abstract 
problem specification in the form of a statistical model and 
translates it into executable code. On the inside, however, it 
works quite different: AutoBayes first derives a customized 
algorithm implementing the model and then transforms it 
into optimized C/C++ code implementing the algorithm. The 
) algorithm derivation or synthesis process — which distinguishes 
AutoBayes from traditional compilers — relies on the intri- 
cate interplay of three key techniques. (/) AutoBayes uses 
Bayesian networks (BNs) [4], [17] as a compact internal rep- 
resentation of the statistical models. BNs provide an efficient 
encoding of the joint probability distribution over all variables 
and thus enable replacing expensive probabilistic reasoning by 
faster graphical reasoning. In particular, they speed up the de- 
composition of a problem into statistically independent simpler 
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model mog as 'Multivariate Mixture of Gaussians ' ; ,. , ,. .. . . . . . . 

distributed as a discrete distribution with the relative class 


const int D := 5 as 'number of bands' const 
int N as 'number of data points' 
with 1 < N ; 

const int C as 'number of classes' 
with 1 < C ; 
with C << N; 

double phi(l..C) as 'class probabilities' 
with 1 = sum(_i : = 1..C, phi(_i)); 
double mu ( 1 . . D , 1 . . C ) ; double s igma ( 1 . . D , 1 . . C ) ; 

output int c(l..N) as 'latent variable'; c ( ) ~ 

discrete (phi) ; 

data double x ( 1 . . D, 1 . . N ) ; x (_i , _ j ) ~ gauss (mu (_ 
sigma ( i , c ( j ) ) ) ; 

max pr (x | {phi, mu, sigma}) wrt {phi, mu, sigma] 

Fig. 1. AUTOB AYES- specification for Gaussian mixture model. Keywords 
are underlined. 

subproblems, (ii) AutoBay ES uses program, schemas as the 
basic building blocks for the algorithm derivation. Schemas 
consist of a parameterized code fragment or template and a 
set of constraints which are formulated as conditions on BNs. 
The templates can encapsulate advanced algorithms and data 
structures, which lifts the abstraction level of the algorithm 
derivation. The constraints allow the network structure to guide 
the application of the schemas, which prevents a combinatorial 
explosion of the search space. (/«) AutoBayes contains a 
specialized symbolic subsystem which can find closed-form 
solutions for many problems and emerging subproblems. The 
combination of these techniques results in a fast synthesis 
process which compares in speed to the compilation of the 
synthesized code. 

Specification Language. A statistical model describes the 
properties of the data in a fully declarative fashion: for each 
problem variable of interest (i.e., observation or parameter), 
properties and dependencies are specified via probability dis- 
tributions and constraints. Figure 1 shows how the standard 
Gaussian mixture model with diagonal covariance matrices 
can be represented in AutoB ayes’s specification language. 
The model assumes that the data consists of N points in 
D dimensions such that each point belongs to one of C 
classes; the first few lines of the specification just declare 
these symbolic constants and specify the constraints on them. 
However, instead of drawing each point x ( l . . C, _j ) (where 
. . corresponds to Matlab’s : subrange operator, and _i, j 
are index variables) from a multivariate Gaussian c (j ) with 
a full D x D-dimensional covariance matrix, each band _i is 
drawn independently from a univariate Gaussian with mean 
mu ( _i , c ( _j ) ) and standard deviation s igma ( _i , c ( _ j ) ) . 
The unknown distribution parameters can be different for each 
class and each band; hence, we declare them as matrices. The 
unknown assignment of the points to the distributions (i.e., 


frequencies given by the also unknown vector phi. Since 
each point must belong to a class, the sum of the probabilities 
must be equal to one. Finally, we specify the goal inference 
task, maximizing the conditional probability pr (x | {phi , 
mu, sigma}) with respect to the parameters of interest, 
phi, mu, and sigma. This means that we are interested in a 
maximum likelihood estimate (MLE) of the model parameters; 
maximum aposteriori estimates (MAP) can be specified by 
adding priors to the model. Note that the model is completely 
declarative and does not require the user to prescibe any 
i al < g<pritl}rjiic aspects of the estimation program. AUTOBAYES is 
thus “free to select any clustering algorithm that is applicable; 
however, users can force the derivation of specific solutions 
; (e.g. k - means instead of EM) via command line parameters. 

Bayesian Networks. A Bayesian network is a directed, 
acyclic graph whose nodes represent random variables and 
whose edges define probabilistic dependencies between the 
random variables. AUTOBaYES uses hybrid BNs with plates 
[4] to represent the statistical models internally. Hence, nodes 
can represent discrete as well as continuous random variables. 
Plates generalize the concept of independent and identically 
distributed (Lid.) random variables and “collapse” collections 
of independent, co-indexed random variables into graph nodes 
representing the non-repeated core structure; this keeps the 
graphs compact and the graphical reasoning routines (e.g., 
computing the parents, children, or Markov blanket [17] of 
a node) fast. Distribution and dimension information for the 
random variables is attached to the respective nodes and plates. 

Program Schemas. A schema consists of a parameterized 
code fragment (i.e., template) and a set of constraints. The 
parameters are instantiated by AutoBayes, either directly 
or by calling itself recursively with a modified problem. 
The constraints determine whether a schema is applicable 
and how the parameters can be instantiated. Constraints are 
formulated as conditions on the Bayesian network or directly 
on the specified model; they include the maximization goal 
as special case. This allows the network structure to guide the 
application of the schemas and thus to constrain combinatorial 
explosion of the search space, even if a large number of 
schemas is available. Schemas are implemented as Prolog- 
clauses and search control is thus simply relegated to the 
Prolog-interpreter: schemas are tried in their textual order. 
This simple approach has not caused problems so far, mainly 
because the domain admits a natural layering which can be 
used to organize the schema library. The top layer comprises 
network decomposition schemas which try to break down 
the network into independent subnets, based on independence 
theorems for Bayesian networks. These are domain-specific 
divide-and-conquer schemas: the emerging subnets are fed 
back into the synthesis process and the resulting programs 
are composed to achieve a program for the original problem. 
AutoBayes is thus able to automatically synthesize larger 


classes) is represented by the latent variable c; since we are programs by composition of different schemas. The next layer 
interested in the classification results as well (and not only comprises more localized decomposition schemas which work 

the distribution parameters), c is declared as output, c is on products of i.Ld. variables. Their application is also guided 



by the network structure but they require more substantial 
symbolic computations. The core layer of the library contains 
statistical algorithm schemas as for example expectation max- 
imization (EM) [9], [14] and k-Means (i.e., nearest neighbor 
clustering); these generate the skeleton of the program. The 
final layer contains standard numeric optimization methods 
as for example the Nelder-Mead simplex method or different 
conjugate gradient methods. 

Symbolic Subsystem. AutoBayes relies significantly on 
symbolic computations to support schema instantiation and 
code optimization. The core part of the symbolic subsystem 
implements symbolic-algebraic computations, similar to those 
in Mathematica [19]. It is based on the concept of term 
rewriting [2] and uses a small but reasonably efficient rewrite 
engine. Expression simplification and symbolic differentiation 
are implemented as sets of rewrite rules for this rewrite engine. 
The basic rules are straightforward; however, the presence of 
vectors and matrices introduce a few complications and require 
a careful formalization. In addition, AutoBayes contains a 
rewrite system which implements a domain-specific refinement 
of the standard sign abstraction where numbers are not only 
abstracted into pos and neg but also into small (i.e., | x | < 1) 
and large . AutoBayes then uses a relatively simple symbolic 
equation solver built on top of these rewrite systems. 

Backend. The code constructed by schema instantiation 
and composition is represented in an imperative intermediate 
language. This is essentially a “sanitized” subset of C (e.g., 
no pointers), which is extended by a number of domain- 
specific constructs like vector and matrix operations, finite 
sums, and convergence-loops. Since straightforward schema 
application can produce suboptimal code, AutoBayes in- 
terleaves synthesis and advanced code optimization (cf. [16] 
for an overview). Schemas can explicitly trigger aggressive 
large-scale optimizations like code motion, common sub- 
expression elimination and memoization which can take ad- 
vantage of information from the model and the synthesis 
process. Traditional low-level optimizations like constant prop- 
agation or loop fusion, however, are left to the compiler. In a 
final step, AutoBayes translates the intermediate code into 
code tailored for a specific run-time environment. Currently, 
AutoBayes includes code generators for the Octave and 
Matlab environments; it can also produce stand-alone C and 
Modula-2 code. 


Redshifts” can be driven sufficiently low enough we can, 
for the first time, use a sample of order 10 8 . This two 
orders of magnitude improvment could have very significant 
implications for contemporary theories of the Universe. 

SDSS photometry (five broad band filters/colors ugriz) with 
calculated accurate photometric redshifts is our goal. For ex- 
ample, the SDSS will have I0 6 (to date ^ 125, 000 measured) 
galaxy redshifts. The next largest survey has approximately 
220,000. Ail of the rest of the redshifts surveys do not add up 
to that of the 2dFGRS alone. SDSS photometry will eventually 
consist of 10 8 objects (53 x 10 6 currently). Again, no survey 
approaches this quantity of data. Another survey is closest 
with 400 million objects, but only two ’’colors” are measured, 
it is spread across entire sky, is a much shallower survey 
and consists mostly of stars within our own galaxy, rather 
than external galaxies as in the SDSS. The results from the 
latest methods used to attack the Photometric Redshifts in the 
SDSS range from root mean squared errors from 0.034 - 0.066 
and show considerable variability due to sampling. To be able 
to map the filamentary structures in the Universe we need a 
significant improvement in the root mean square error of the 
competing methods. 


Distance Cutoff = 1.25 # Background Galaxies^ 164 
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Fig. 2. Example clustering of one small section of the sky using spectro- 
scopically determined redshifts. Crosses indicate field-galaxies, i.e., those that 
are not on filaments. Dots indicate galaxies that are on filaments. 


V. Experiments and Results 
A. Sloan Digital Sky Survey 

Mapping the large scale structure of the universe is nec- 
essary in order to better constrain formation scenarios of 
structures of all scales (from galaxies to large walls) in the 
universe. To this end, measuring the ’’distances” and x-y 
projections on the sky of the largest number of objects possible 
is necessary. Thus far it has been difficult to use only broad 
band color data to accurately map mass on a broad range 
of scales. Astronomers have only been successful in doing 
this on small numbers of spectroscopically measured galaxies 
(of order 10 5 ). If the errors on what we call ’’Photometric 


B. Choosing Parameters for the Mixture Model 

In a first set of experiments, clustering using AutoBayes 
generated code was used. Our model is a multi-variate mixture 
of Gaussians. Figure 1 shows the entire AutoBayes speci- 
fication. We ran this model and varied the desired number 
of classes from 3 to 30. Because our EM algorithm uses a 
randomized initialization, 10 independent runs were carried 
out. Figure 3 shows the log-likelihood for the given parameters 
after clustering, the solid line shows its mean. From this graph 
it is obvious that the initialization plays an important role as 
it strongly influences the result. From this figure one can also 
deduce that the best number of clusters for the given data set is 




» 


around 12. For larger numbers of clusters, the log-likelihood 
does not change much, indicating that the increased model 
complexity does not appreciably increase the fit of the model. 
For each of the clustering runs, EM needed between 5 and 55 
iterations, with mean of 14.2 iterations to converge. 



Fig. 3. Log-likelihood of a Gaussian mixture model with no priors as a 
function of the number of components in the model. Each box corresponds 
to one run. Notice that there is substantial variation in the terminal value of 
the likelihood function, which is due to the well-known sensitivity of the EM 
algorithm to initial conditions. 

The redshift of a galaxy has a strong connection on how 
its spectral features are mapped onto the 5 spectral bands 
given by the symbols (it.g.r, i, and z). If, for example, a 
significant spectral feature of a near galaxy shows up in band 
r, the corresponding feature of a similar, but distant galaxy 
would be shifted toward the next band, i. Thus, we can assume 
that the data points in the different bands are not uncorrelated 
(as in the previous model), but that they have correlations 
with the neighboring band. This extended model, which has 
a band covariance matrix, includes a simple transformation 
of the data: the new clustering algorithm gets the original 5 
bands, but also the difference signal between adjacent bands: 
u — g, g ~ r,r ~ i,i — z. Figure shows the results of this 
clustering in terms of the likelihood function. The likelihood 
function shows similar variation, and when penalized for the 
additional model complexity, also indicates that the correct 
number of clusters is around 12. 

In the next experiment, a subset of our training data set was 
used. It contained all data points, for which the measured red- 
shift was larger that 0.3. This data set contains 4530 of the 
52744 data points. A similar clustering experiment (with the 
above AutoBayes model) revealed that the best number of 
clusters for distant galaxies is much lower (around 5). Figure 5 
illustrates this. 

C. Incorporating Prior Knowledge 

In order to incorporate prior information in the AutoBayes 
model, we specified conjugate priors on the mean values p. 
The only changes of the specification are the declarations of 
the prior parameters po and their relationship to the mean, 
and a new optimization goal: 



number ot classes 


Fig. 4. Log-likeiihood over number of components (shifted case) 



number of classes 


Fig. 5. Log-likeiihood over number of components; distant galaxies only 


double mu_0 (0 . .D-I , 0 . . C-l) . double kappa_G (0 . . D-l , 
0..C-1). mu(_i,_j) ~ gauss (mu_0 (_i ,_j ) , 

sqrt (sigma (_i , _j ) ) *kappa_0 (_i , _j ) ) 

max pr({mu, x} | { sigma, phi }) wrt (phi, mu, 
sigma} . 

The rest of the specification remains unchanged. AutoBayes 
automatically instantiates the appropriate EM algorithm with 
a highly complex log-likelihood function given in Section 
2. A visual investigation of the data displayed in Figure 6 
indicated that cluster centers need to be placed in spectral 
regions which would model high redshift galaxies. We placed 
5 clusters, based on the results from the clustering of distant 
galaxies only, at the spectroscopic inputs corresponding to 
those galaxies. Those clusters had priors associated with them 
on the r spectral band, since that has maximum correlation 
with the redshift. The remaining input dimensions had no 
priors. Furthermore, we did not place priors on the mixture 
weights or the covariance matrices. Non-isotropic diagonal 
covariance matrices were used in this study. With this model, 
we trained 20 mixture models to build the Mixture Density 
Mercer Kernel. 
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Fig. 6. This figure shows a multivariate scatter piot between the bands 
(u, p, r t i,z) and the redshift. Galaxies which are farther away, i.e., those 
with higher redshift have lower spectral energy in the band, as expected. 
Nearby galaxies have high spectral content. This information was used to set 
priors in subsequent models. 

D. Evaluation of Results 

The Mixture Density Mercer Kernel was built using prob- 
abilistic models that included priors as well as those without 
priors on a training set of 1500 galaxies and a test set of 
5000 galaxies. We first submitted the MDMK along with the 
5000 test galaxies to a single CART decision tree module 
available in Matlab. The resulting confusion matrix indicated 
that only 77% of the distant galaxies (those with a redshift 
greater that 0.3) were classified correctly. Thus, the model had 
a true positive rate of 77%. Using the Mixture Density Mercer 
Kernel this rate was dramatically improved to approximately 
93% using the same training and test data. The tree and 
the MDMK classified approximately 99% and 97% of the 
nearby galaxies correctly. This however, is an easy problem 
since nearby galaxies may have high spectral energy content, 
whereas distant galaxies never have high spectral content. It 
is much more difficult to distinguish far galaxies from those 
that are dim and near. 

The Mixture Density Mercer Kernel performed significantly 
better than the benchmark classifier that we used regardless 
of the use of prior information. It turned out that in this 
application, prior information only improved the results of the 
false negative rate by about 1%, which is within the variation 
due to the model uncertainty. Subsequent research into the 
specific location of the priors and the shape of the covariance 
matrices will be performed. 

In order to further test the quality of the SVM based on 
Mixture Density Mercer Kernels, we built an ensemble of 
20 bagged trees that were built on bootstrap replicates drawn 
from the training set. For this scenario, we found that the true 
positive rate increased from 77% for the single tree to 86%. 
The true negative rate remained the same. However the true 
positive rate, although appreciably higher, was still lower than 


the result for the Mixture Density Mercer Kernel SVM. 

The next experiment that we performed increased the train- 
ing population for the bagged trees from the original scenario, 
where we were drawing bootstrap samples from 1500 points 
to 45,000 points, representing a 30 fold increase in the amount 
of training data. This dramatic increase in training data helped 
the bagged trees true positive rate, bringing it to approximately 
91%, still 2% lower than the result for the MDMK SVM, 
which was built on a data set 30 times smaller in size. Note 
that for all experiments described here, we report the best 
results out of several runs for all models. 

The significant increase in classification accuracy can be 
attributed to the structure induced in the kernel matrix by 
the mixture modelling process. We computed a kernel matrix 
using the procedure outlined in this paper, and evaluated the 
matrix entries using a data set that was sorted in increasing 
order of redshift. The resulting matrix clearly shows that high 
redshift galaxies are generally not confused with lower redshift 
galaxies by the model. There are two notable exceptions in the 
matrix. Confusion would be indicated by large off diagonal 
elements in the matrix. Note that we have displayed the kernel 
matrix in sorted order only for illustrative purposes. The sup- 
port vector machine’s classification accuracy is independent 
of the order in which the data is presented; the underlying 
mathematics is invariant subject to the permutation of the data 
and the corresponding kernel values. 



Fig. 7. This figure illustrates the reason that the Mixture Density Mercer 
Kernel performs so well on the classification task of identifying nearby 
galaxies from those that are far away. Galaxies that are far away are bunched 
together in the lower right hand comer of the matrix. 

We ran the Mixture Density Mercer Kernels in a support 
vector regression machine in order to directly estimate redshift. 
For this problem, regardless of the use of priors, we were 
able to obtain a root mean squared error of approximately 
0.057 on test data, which is comparable to the error rates of 
published methods on large samples. This data shows a great 
deal of sample-to-sample variability. A CART regression tree 
realized a root mean squared error of approximately 0.045, 
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which is about 10% better than the MDMK described here. 
These results, however, are not surprising since the MDMK is 
built to have high performance on classification tasks. 



Fig. 8. Comparison of the true positive and negative rates and computation 
time as a function of the size of the ensemble. The first panel shows the results 
of running a single tree 2000 times. The second panel shows the results of 
2000 runs of 100 bagged trees. The third panel shows the results of an SVM 
built with the RBF kernel. The fourth and fifth panels show the results of 
the an SVM built with the MDMK kernel and the BTK kernel. Notice that 
the true positive rates are higher for the MDMK kernel at the expense of a 
slightly lower true negative rate and higher computation time. 


VI. Conclusions 

Our results indicate that for a difficult, real-world classi- 
fication task, the Mixture Density Mercer Kernel (MDMK) 
performs better 16% better than a decision tree. We have devel- 
oped a method to incorporate prior knowledge into the model 
which is a novel approach to learning kernels directly from 
data. The MDMK with priors was built with AutoBayes, 
which automatically generates code to model mixture densities 
with prior information. The AutoBayES system generates 
code to model the mixture density based on high-level specifi- 
cations, automatically instantiates the associated EM algorithm 
schema, performs all necessary optimizations, and generates 
the symbolic solution along with the likelihood function. 

We plan to further investigate the use of prior information 
in the Mixture Density Mercer Kernel framework on other 
real world and synthetic problems. The dramatic increase in 
classification accuracy that is exhibited here is most likely 
due to the way the kernel function is constructed. The use 
of prior information may prove to be very useful as new 
understandings about the data generating process and the 
associate physics arise. 

The results also indicate that the Mixture Density Mercer 
Kernel can be an excellent representation for classification 
problems using very small samples of data. In resource con- 
strained environments, where CPU, RAM, or other computa- 
tional power is constrained, this kernel may have utility. We 


plan to explore this avenue further to see how the MDMK be- 
haves under constrained conditions. We also plan to generalize 
the MDMK to multiclass problems. 
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